Goto

Collaborating Authors

 slide image classification


Bi-directionalWeaklySupervisedKnowledge DistillationforWholeSlideImageClassification

Neural Information Processing Systems

An instance feature extractor is shared between theteacher andthestudent tofurther enhance theknowledge exchange between them. In addition, we propose ahard positiveinstance mining strategy based on the output of the student network to force the teacher network to keep mininghardpositiveinstances.


The Rise of AI Language Pathologists: Exploring Two-level Prompt Learning for Few-shot Weakly-supervised Whole Slide Image Classification

Neural Information Processing Systems

This paper introduces the novel concept of few-shot weakly supervised learning for pathology Whole Slide Image (WSI) classification, denoted as FSWC. A solution is proposed based on prompt learning and the utilization of a large language model, GPT-4. Since a WSI is too large and needs to be divided into patches for processing, WSI classification is commonly approached as a Multiple Instance Learning (MIL) problem. In this context, each WSI is considered a bag, and the obtained patches are treated as instances. The objective of FSWC is to classify both bags and instances with only a limited number of labeled bags. Unlike conventional few-shot learning problems, FSWC poses additional challenges due to its weak bag labels within the MIL framework.


Bi-directional Weakly Supervised Knowledge Distillation for Whole Slide Image Classification

Neural Information Processing Systems

Computer-aided pathology diagnosis based on the classification of Whole Slide Image (WSI) plays an important role in clinical practice, and it is often formulated as a weakly-supervised Multiple Instance Learning (MIL) problem. Existing methods solve this problem from either a bag classification or an instance classification perspective. In this paper, we propose an end-to-end weakly supervised knowledge distillation framework (WENO) for WSI classification, which integrates a bag classifier and an instance classifier in a knowledge distillation framework to mutually improve the performance of both classifiers. Specifically, an attention-based bag classifier is used as the teacher network, which is trained with weak bag labels, and an instance classifier is used as the student network, which is trained using the normalized attention scores obtained from the teacher network as soft pseudo labels for the instances in positive bags. An instance feature extractor is shared between the teacher and the student to further enhance the knowledge exchange between them. In addition, we propose a hard positive instance mining strategy based on the output of the student network to force the teacher network to keep mining hard positive instances. WENO is a plug-and-play framework that can be easily applied to any existing attention-based bag classification methods. Extensive experiments on five datasets demonstrate the efficiency of WENO. Code is available at https://github.com/miccaiif/WENO.


TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

Neural Information Processing Systems

Multiple instance learning (MIL) is a powerful tool to solve the weakly supervised classification in whole slide image (WSI) based pathology diagnosis. However, the current MIL methods are usually based on independent and identical distribution hypothesis, thus neglect the correlation among different instances. To address this problem, we proposed a new framework, called correlated MIL, and provided a proof for convergence. Based on this framework, we devised a Transformer based MIL (TransMIL), which explored both morphological and spatial information. The proposed TransMIL can effectively deal with unbalanced/balanced and binary/multiple classification with great visualization and interpretability. We conducted various experiments for three different computational pathology problems and achieved better performance and faster convergence compared with state-of-the-art methods. The test AUC for the binary tumor classification can be up to 93.09% over CAMELYON16 dataset. And the AUC over the cancer subtypes classification can be up to 96.03% and 98.82% over TCGA-NSCLC dataset and TCGA-RCC dataset, respectively.


GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

Quang, Ngoc Bui Lam, Binh, Nam Le Nguyen, Nguyen, Thanh-Huy, Nguyen, Le Thien Phuc, Nguyen, Quan, Bagci, Ulas

arXiv.org Artificial Intelligence

Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.


Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation

Jin, Jing, Liu, Xu, Gao, Te, Shi, Zhihong, Liang, Yixiong, Zheng, Ruiqing, Kuang, Hulin, Zeng, Min, Kan, Shichao

arXiv.org Artificial Intelligence

Whole Slide Image (WSI) representation is critical for cancer subtyping, cancer recognition and mutation prediction.Training an end-to-end WSI representation model poses significant challenges, as a standard gigapixel slide can contain tens of thousands of image tiles, making it difficult to compute gradients of all tiles in a single mini-batch due to current GPU limitations. To address this challenge, we propose a method of dynamic residual encoding with slide-level contrastive learning (DRE-SLCL) for end-to-end WSI representation. Our approach utilizes a memory bank to store the features of tiles across all WSIs in the dataset. During training, a mini-batch usually contains multiple WSIs. For each WSI in the batch, a subset of tiles is randomly sampled and their features are computed using a tile encoder. Then, additional tile features from the same WSI are selected from the memory bank. The representation of each individual WSI is generated using a residual encoding technique that incorporates both the sampled features and those retrieved from the memory bank. Finally, the slide-level contrastive loss is computed based on the representations and histopathology reports ofthe WSIs within the mini-batch. Experiments conducted over cancer subtyping, cancer recognition, and mutation prediction tasks proved the effectiveness of the proposed DRE-SLCL method.


torchmil: A PyTorch-based library for deep Multiple Instance Learning

Castro-Macías, Francisco M., Sáez-Maldonado, Francisco J., Morales-Álvarez, Pablo, Molina, Rafael

arXiv.org Artificial Intelligence

Multiple Instance Learning (MIL) is a powerful framework for weakly supervised learning, particularly useful when fine-grained annotations are unavailable. Despite growing interest in deep MIL methods, the field lacks standardized tools for model development, evaluation, and comparison, which hinders reproducibility and accessibility. To address this, we present torchmil, an open-source Python library built on PyTorch. torchmil offers a unified, modular, and extensible framework, featuring basic building blocks for MIL models, a standardized data format, and a curated collection of benchmark datasets and models. The library includes comprehensive documentation and tutorials to support both practitioners and researchers. torchmil aims to accelerate progress in MIL and lower the entry barrier for new users. Available at https://torchmil.readthedocs.io.


Hypergraph Mamba for Efficient Whole Slide Image Understanding

Lu, Jiaxuan, Lin, Yuhui, Shi, Junyan, Yan, Fang, Zhou, Dongzhan, Gao, Yue, Wang, Xiaosong

arXiv.org Artificial Intelligence

Whole Slide Images (WSIs) in histopathology pose a significant challenge for extensive medical image analysis due to their ultra-high resolution, massive scale, and intricate spatial relationships. Although existing Multiple Instance Learning (MIL) approaches like Graph Neural Networks (GNNs) and Transformers demonstrate strong instance-level modeling capabilities, they encounter constraints regarding scalability and computational expenses. To overcome these limitations, we introduce the WSI-HGMamba, a novel framework that unifies the high-order relational modeling capabilities of the Hypergraph Neural Networks (HGNNs) with the linear-time sequential modeling efficiency of the State Space Models. At the core of our design is the HGMamba block, which integrates message passing, hypergraph scanning & flattening, and bidirectional state space modeling (Bi-SSM), enabling the model to retain both relational and contextual cues while remaining computationally efficient. Compared to Transformer and Graph Transformer counterparts, WSI-HGMamba achieves superior performance with up to 7* reduction in FLOPs. Extensive experiments on multiple public and private WSI benchmarks demonstrate that our method provides a scalable, accurate, and efficient solution for slide-level understanding, making it a promising backbone for next-generation pathology AI systems.


PTCMIL: Multiple Instance Learning via Prompt Token Clustering for Whole Slide Image Analysis

Zhao, Beidi, Kim, SangMook, Chen, Hao, Zhou, Chen, Gao, Zu-hua, Wang, Gang, Li, Xiaoxiao

arXiv.org Artificial Intelligence

Multiple Instance Learning (MIL) has advanced WSI analysis but struggles with the complexity and heterogeneity of WSIs. Existing MIL methods face challenges in aggregating diverse patch information into robust WSI representations. While ViTs and clustering-based approaches show promise, they are computationally intensive and fail to capture task-specific and slide-specific variability. To address these limitations, we propose PTCMIL, a novel Prompt Token Clustering-based ViT for MIL aggregation. By introducing learnable prompt tokens into the ViT backbone, PTCMIL unifies clustering and prediction tasks in an end-to-end manner. It dynamically aligns clustering with downstream tasks, using projection-based clustering tailored to each WSI, reducing complexity while preserving patch heterogeneity. Through token merging and prototype-based pooling, PTCMIL efficiently captures task-relevant patterns. Extensive experiments on eight datasets demonstrate its superior performance in classification and survival analysis tasks, outperforming state-of-the-art methods. Systematic ablation studies confirm its robustness and strong interpretability. The code is released at https://github.com/ubc-tea/PTCMIL.


GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning

Raju, S M Taslim Uddin, Islam, Md. Milon, Haque, Md Rezwanul, Altaheri, Hamdi, Karray, Fakhri

arXiv.org Artificial Intelligence

Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSI face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these issues, we introduce a novel GNN-ViTCap framework for classification and caption generation from histopathological microscopic images. First, a visual feature extractor generates patch embeddings. Redundant patches are then removed by dynamically clustering these embeddings using deep embedded clustering and selecting representative patches via a scalar dot attention mechanism. We build a graph by connecting each node to its nearest neighbors in the similarity matrix and apply a graph neural network to capture both local and global context. The aggregated image embeddings are projected into the language model's input space through a linear layer and combined with caption tokens to fine-tune a large language model. We validate our method on the BreakHis and PatchGastric datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning. Experimental results demonstrate that GNN-ViTCap outperforms state of the art approaches, offering a reliable and efficient solution for microscopy based patient diagnosis.